<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8">
  <title></title>
  <meta name="GENERATOR" content="OpenOffice.org 1.0.3  (Linux)">
  <meta name="CREATED" content="20030620;10312300">
  <meta name="CHANGED" content="20030703;12173400">

</head>
<body lang="de-DE" text="#000000">
<p lang="en-US" class="western" style="margin-bottom: 0cm;">
<h1>A gentle introduction to AMOS</h1>
</p>
<p lang="en-US" class="western" style="margin-bottom: 0cm;"><br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
The world of Free/Libre Open
Source Software (FLOSS for short) has seen a growth that is
unbelievable just a few years ago. When the popular FreshMeat.net
site was launched, there were just a few updates per day. Now,
usually there are more than 50 new software announcements per day,
and this represents just a small percentage of the true number - just
consider SourceForge and its more than 60.000 projects, of which a
large number are at the "beta" or better stage, or all the
specialized software packages developed in the scientific and
research communities.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
As much as it seems a wonderful
thing, it is becoming a problem in itself just finding what you are
looking for - not only for users, but especially for developers and
system integrators that may be willing to cooperate with an existing
project, rather than starting something new from scratch. This is
especially complex when someone wants to search for specific
capabilities that may be embedded in a package, but that are not
apparent from the documentation (that may be nonexistent to start
with). And if it is a problem now, imagine what will happen in a few
years, if the growth in FLOSS software production continues at this
pace!</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
As members of the IST-sponsored
AMOS project, we are trying to devise a potential solution to the
search problem, through the development of a specialized search
engine devoted to searching software code and other code-related
artifacts (like code snippets, test cases and such). We are trying to
combine several technologies to improve on existing techniques:</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<ul>
  <li>
    <p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;"> we
are using a sophisticated, extremely efficient Prolog environment
(called Ciao Prolog, and itself released under the GPL) to be able to
perform complex manipulation of symbols,</p>
  </li>
  <li>
    <p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;"> the
search engine itself is using a structured approach to representing
software packages and their relationships, in the form of a very simple
"ontology" and a dictionary of potential search terms, and</p>
  </li>
  <li>
    <p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;"> the
algorithm used makes it possible to search for "assemblies", or set of
packages that together try to match the user's requirements as closely
as possible.</p>
  </li>
</ul>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
We will try to explain everything
in a question-and-answer way. Stay with us!</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>So, you say that you are trying
to create some sort of search engine. But I can search things on
freshmeat, or sourceforge, or google! Why another one?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
Because searching isn't that
simple. Let me make a small example: suppose that you want to find a
way for creating a UDF filesystem for pressing a DVD. Let us try to
find it with FreshMeat:</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<img src="amos-gentle_html_m2839c51.jpg" name="Graphic1" align="left"
 width="928" height="674" border="0"><br clear="left">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
But we want to create the image,
not read it. The others are not much better:</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<img src="amos-gentle_html_20f26717.jpg" name="Graphic2" align="left"
 width="928" height="674" border="0"><br clear="left">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<span style=""><span lang="en-US">And Google?</span></span></p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<img src="amos-gentle_html_m2ade7e6e.jpg" name="Graphic3" align="left"
 width="928" height="674" border="0"><br clear="left">
<br>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<span style=""><span lang="en-US">If
you know where to look, you can find the right project (that is on
sourceforge, and is called dvd-create). Of course, this is just an
example- if we look for a LaTeX-to-PDF converter, for example,
Freshmeat returns sensible answers at the 15<sup>th</sup> and 16<sup>th</sup>
place, while google fares much better. More complex examples are much
more difficult to search for; even more so in specialized areas where
word meaning can be different. </span></span>
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>How do you plan to do it
better?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
We try to improve the situation by
describing packages through their capabilities. That is, we are
introducing a set of words (called "dictionary") and using those
to describe what the package do. The AMOS search engine is also
capable of combining packages together; let me make another example:
suppose you want to find a way for converting LaTeX to PDF. You can
do it in one sweep, or converting latex to postscript and the
postscript to pdf, or using other intermediate formats. AMOS tries to
find all suitable chain of packages, and creates "assemblies"
that implements (if possible) what is requested.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>If it does all this work, it
will be too slow to be useful- especially if you imagine to have
thousands of packages!</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
We have tested the engine, by
loading all the package descriptions of a RedHat distibution, and
using the dependencies as capabilities. This test used 700 package
descriptions and around 90000 relationships (dependencies) among
them, with a dictionary of some 14000 terms (<b>much</b> more than we
expect in a real database to have). This should give a reasonable
approximation of the kind of load that may end up having. On a fairly
standard PC quite complex queries with all the data stored in a
relational database give back results in a few milliseconds (faster
than RPM, taking into account that dependencies are used
recursively). More detailed information, including detailed figures,
can be found in the technical report <i>The Matching Engine Design.</i></p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
One general reason for this speed
is the use of a technique termed <i>inverted indexes,</i> which, so
to speak, precomputes part of the search work. This has been
classically used in relational databases, and is also in the heart of
the Google web search tool.<br>
<br>
</p>
<h2>
Ok, let's say that
we have thousands of packages, and our search returns several
possible assemblies. How can we sort through it?</h2>
We added to the engine the capability of using heuristics and it is
desined so that these can be modified or added easily. For example, it
should be possible to prefer sets of packages that are written in the
same language, or set of packages with the minimum size. We plan to
add to the engine a series of sensible preference rules in order to
guide the search and not to overwhelm the user with many undesired
matches.  As an example, we will add to the engine a special "license"
part that will warn when assemblies made of GPL and non-GPL software
are mixed together, to prevent potential licensing problem. This is
only a warning, since for example mixing user-level packages is always
allowed.
<br>
<br>

<h2>
<b>You target FLOSS software. What
about you?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
The Ciao prolog system is already
open source software. The AMOS engine is also distributed under the
GPL, and the database and tool information that will be filled up
will be under the Free Documentation License (FDL). 
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>But, Prolog?? Why not &lt;python,
C, C++, Java, name-your-favourite here..&gt;</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
Modern Prolog compilers (Ciao
Prolog among them) are extremely efficient, and Prolog data
structures are perfectly suited for the task at hand. Also, the
declarative nature of Prolog maps extremely well to our algorithm,
and there are several nice libraries for generation of web pages and
interfacing with databases. If you really, really want it, the Ciao
compiler can output C as its target, so you can have C if you really
want (not that it's much readable, and it will not give you more
capabilities than the Prolog system itself.)</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>Ok, I bought it. In one of your
technical papers (<i>The Internal Query Language Design</i>) you
present also an API (Application Programmer's Interface) to directly
call the engine. Is it only for Prolog?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
No. As a direct inheritance from
Ciao Prolog, the engine can be called from Java and C, and more
interfaces are on their way.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>What about XML? I like XML!</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
XML is nice as a representation
language; it is quite easy to take the database engine dump (that is,
the full content of the database) and output XML. Reasoning over XML
is still a difficult problem, expecially in terms of efficiency when
the XML tree is huge. So, we prefer to do all the reasoning and the
internal representation in the Ciao internal format, and leave to
output plugins the task of converting to-from other formats.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>Ok, your project seems
interesting. I would like to know something more about how to do
searching.</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
First, a little introduction on
the dictionary and the ontology (which, in our case, is just a
different way of calling a tree containing all the relations between
packages and the words that describe them). The dictionary is a long
list, containing many different "words", or dictionary atoms. For
example, "image_blurring" may be a suitable "word" (even if
it is composed by two english words). To a dictionary atom we can
also attach several synonims, and one generalization- for example,
you can generalize "SQL database" into "database". This is
used by the engine when no possible match is found, and
generalization are used to try to find an approximate solution. The
dictionary list is quite important in AMOS, and most of the effort in
adding packages to the search engine is really related to finding the
best "words" for describing packages. It is also quite important
that the words that are added are not already in, thus leading to
what we call "dictionary pollution".</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>"Dictionary pollution"?
What's that?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
A fundamental point of our project
is the fact that the dictionary stays manageable, that is that it
does not grow too much. All searches in AMOS are done only through
words that are in the dictionary. You can freely use them in the
search row, or use the boxes and select among them (eventually using
the CTRL and shift keys to perform multiple selections), but you
can't perform a free-text search (it will be no different from
Google, in that case).</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
So, it is important to maintain
the dictionary size to a minimum, because people performing the
search will be forced to sift through the list, at least in the
beginning. 
</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>How do you plan to avoid that?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
By having an administrative
interface, that allows for our reviewers to decide if the words are
adequate, and eventually suggest alternatives. This administrative
work is part of the project, and after the end of the contract we
plan to give this administrative task also to the community. We will
maintain to our best the system for at least 2 years after the end of
the contract, providing connectivity and machines for that, and we
hope to be able to donate at least a person to continue the
maintenance work. We hope that the system will be adopted, and will
be happy to help anyone that wants to deploy it. Everything
contributed will be under the FDL, and periodically we will provide
dumps of it for download.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>How does the administration
work?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm; font-style: normal;">
When you fill in a package, you
simply submit it through the web interface. It is then saved in a
temporary database, and reviewed for consistency; the reviewer can
propose modification, that are sent back to the submitter through
email (including comments). When the reviewer gets back to the
submission page, she gets back all the filled fields and the
comments, and can change it at will and resubmit. If it is accepted,
it gets immediately into the database, and can be immediately
searched.</p>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm;">
<br>
</p>
<h2>
<b>Will it always be like that?</b></h2>
<p lang="en-US" class="western" align="justify"
 style="margin-bottom: 0cm; font-style: normal;">
If we see that the need for
modifications remains small, in the end we can opt for a
no-administration system, and simply let the people enter packages
directly into the database. This decision will be left for the end of
the project, probably. Of course, the source is there - if you want,
you can install your own AMOS!</p>
</body>
</html>
